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INTRODUCT ION 


Some  of  the  most  important  and  interesting  information 
processing  tasks  that  men  and  women  in  the  real  world  face 
are  essentially  hierarchical  (or  cascaded)  in  nature  and 
involve  different  people  with  different  types  of  information 
that  must  be  combined.  In  a  large  class  of  such 
hierarchical  tasks  a  distinction  is  made  between  the 
diagnostic  impact  of  a  piece  of  information  and  its 
reliability.  In  a  legal  context,  for  example,  a  witness  may 
testify  about  having  seen  the  defendant  on  the  scene  of  the 
crime.  This  piece  of  information  might  be  relevant  to 
whether  or  not  the  defendant  committed  the  crime.  In  order 
to  degrade  the  diagnostic  impact  of  the  piece  of 
information,  the  defense  attorney  may  try  to  establish  that 
the  witness'  testimony  is  unreliable.  Schum  (Notes  1  &  2) 
has  formalized  hierarchical  structures  like  these  and  a 
substantial  empirical  literature  exists  on  how  people 
perform  in  simple  cascaded  inference  tasks  (e.g.  Peterson, 
1973)  . 

The  normative  formula  for  calculating  posterior  odds  (in 
terms  of  the  probability  forms  of  reliability  (R)  and 
diagnosticity  (D))  assuming  equal  priors  and  symmetric 
reliability  and  diagnosticity  is: 

Post.  Ocas  »  (2RD  -  R  -  D  +  1)  /  (-2RD  +  D  +  R) 


This  assumes  equal  priors,  symmetric  reliability,  and 
symmetric  diagnosticity. 

Most  of  the  empirical  research  on  hierarchical  inference 
had  a  single  person  process  both  diagnosticity  and 
reliability  information.  Typically,  these  experiments  have 
found  that  subjects  were  radical  in  comparison  to  the 
normative  responses  specified  by  Modified  Bayes  Theorem 
(MBT)  (Gettys  and  Willke,  1969).  That  is,  the  unreliability 
of  the  report  of  the  diagnostic  event  was  taken  into 
account,  but  not  enough.  In  contrast  to  this  single  person 
paradigm,  many  real  world  inference  problems  involve 
different  experts,  some  with  diagnosticity  expertise  and 
some  with  reliability  expertise.  It  is  at  least 
questionable  whether  the  single  person  results  would 
generalize  to  the  multiple  expert  inferences.  Specifically, 
one  might  expect  the  reliability  information  to  be  taken 
into  account  more  strongly  when  the  use  of  the  information 
is  advocated  by  a  person  with  special  expertise  in 
reliability. 

The  present  experiment  examined  this  hypothesis  by 
comparing  the  relative  impact  of  reliability  and 
diagnosticity  in  a  single  vs.  two  person  paradigm,  in  v/hich 
one  person  had  reliability  information  and  the  other  had 
diagnosticity  information.  We  also  used  a  power 
manipulation  in  the  two  person  groups  which  gave  the  ability 
to  make  the  final  decision  on  the  response  to  either  the 


reliability  or  the  diagnostici ty  "expert".  We  felt  that 
this  power  manipulation  would  affect  the  relative  impact  of 
the  reliability  and  aiagnosticity  information. 

The  early  experiments  in  cascaded  inference  usually  used 
a  paradigm  in  which  subjects  performed  a  sampling  task  or  a 
difficult  perceptual  task.  These  laboratory  tasks  do  not 
parallel  any  real  world  tasks.  Furthermore,  subjects  must 
first  make  a  judgement  about  the  probability  of  the  event 
and  then  aggregate  that  judgement  with  the  diagnostic  impact 
of  the  event.  Sebum' s  normative  models  closely  parallel 
complex  real-world  hierarchical  tasks,  but  do  not  provide 
the  specific  correct  inputs  into  the  models,  so  a  numerical 
comparison  between  subjects'  responses  and  the  normative 
responses  is  difficult  to  make. 

In  this  study  we  asked  subjects  to  role  play  the  parts 
of  personnel  officers  evaluating  job  applicants  for 
positions  as  electronics  repairpersons. 

Subjects  receiveu  training  in  the  interpretation  of 
aiagnosticity,  reliability,  or  both;  and  saw  probabilities 
ana  likelihood  ratios  representing  their  information.  The 
fairly  detailed,  pilot  tested  scenario  provided  a  setting  in 
which  to  present  pass/fail  test  scores  for  the  hypothetical 
job  applicants  that  had  a  specific  diagnostic  impact  on  the 
probability  of  success  on  the  job,  along  with  reliability 
information  on  the  pass/fail  score.  The  goals  of  the  design 
were  as  far  as  possible  to  combine  an  interesting  task  with 
availability  of  a  normatively  correct  solution,  and  to  focus 
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attention  on  the  hierarchical  nature  of  the  task. 

METHOD 

Subjects 

Subjects  were  41  male  and  31  female  undergraduates 
enrolled  in  an  introductory  psychology  course  at  the 
University  of  Southern  Caifornia.  Subjects  signed  up  for 
and  participated  in  several  hours  of  research  ir.  partial 
fufillment  of  class  requirements. 

Materials  and  Stimuli 

The  scenario  included  some  training  in  the  use  of 
probability  and  odds.  Depending  on  the  condition  to  which 
subjects  were  assigned  the  instructions  included  information 
about  the  meaning  of  reliability,  diagnosticity,  or  both. 

We  presented  the  diagnosticities  and  reliabilities  on 
forms  that  contained  relative  frequencies,  probabilities, 
ana  odds.  Subjects  responded  in  odds  on  a  logarithmically 
spaced  response  scale  (log-odds  scale) .  The  scale  was 
symetric  around  "1  :  1"  with  endpoints  of  "1000  :  1"  and  "1 

:  1000". 

Procedure 

Subjects  were  randomly  assigned  to  one  of  four 
experimental  conditions:  In  condition  1  subjects  worked 

alone  and  received  both  the  reliability  and  diagnosticity 
information.  In  condition  2,  subjects  worked  in  pairs.  One 
member  of  each  pair  received  diagnosticity  information  and 
training; the  other  received  similar  information  and  training 
about  reliability.  The  pair  was  asked  to  reach  a  consensus 
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juagement.  Conditions  3  and  4  were  identical  to  condition  2 
except  that  either  the  diagnosticity  expert  (condition  3)  or 
the  reliability  expert  (condition  4)  was  given  the  power  to 
overrule  the  other  subject  on  the  final  response. 

The  experimenter  assigned  subjects  to  their  roles  with 
a  flip  of  a  coin,  in  their  sight.  The  experimenter  then 
read  aloud  the  general  set  of  instructions  to  the  subjects 
wnile  they  had  a  copy  of  them.  General  instructions 
contained  a  scenario  description,  a  summary  of  the  type  of 
information  to  be  provided,  a  description  of  how  bonuses 
were  to  be  awarded  (using  a  modification  of  the  quadratic 
scoring  rule) ,  and  a  brief  lesson  on  the  use  of  probability 
ano  odds . 

The  experimenter  encouraged  subjects  to  ask  questions 

when  confused  and  quizzed  them  on  key  points. 

\ 

The  experimenter  then  read  and  discussed  the 
instructions  on  the  diagnosticity  information  (to  the 
diagnosticity  expert  alone  in  the  two  person  conditions) . 
These  instructions  contained  information  about  the 
probabilistic  meaning  of  diagnosticity,  the  presentation  of 
the  diagnosticities ,  and  how  the  diagnosticities  would  vary. 
Next,  the  experimenter  presented  the  reliability  instructions 
(separately  to  the  reliability  expert  in  the  two  person 
conditions)  answering  the  same  set  of  questions  about  the 
reliabilities.  Covering  the  full  set  of  instructions  took 
about  30  to  75  minutes  depending  on  how  much  help  subjects 


needed . 


Subjects  then  worked  through  two  practice  trials. 
During  practice  trials,  subjects  were  tree  to  ask  questions. 


After  each  practice  trial,  the  experimenter  questioned  the 
subjects  to  determine  if  they  understood  the  meaning  of 
their  response,  in  terms  of  the  relative  chance  of 
success/failure  of  the  hypothetical  job  applicant. 

Occasionally  subjects  revised  their  odds  in  the 
incorrect  direction.  We  assumed  this  indicated  that 
subjects  did  not  fully  understand  the  task.  If  this 
occurred  during  the  practice  trials,  the  experimenter 
encouraged  subjects  to  rethink  their  responses  (but  they 
were  not  required  to  change  them.)  (During  data  collection 
trials  if  this  happened  the  meaning  of  the  response  was 
explained  again.  This  happened  rarely.) 

The  subjects  then  ran  the  12  data  collection  trials. 
The  experimenter  scored  a  subset  of  eight  of  the  trials  and 
paid  each  subject  a  bonus  of  three  to  five  dollars,  awarded 
according  to  their  performance  as  scored  by  a  modified  form 
of  the  quadratic  scoring  rule. 

The  same  experimenter  ran  all  subjects. 

RESULTS 

Design 

The  between  groups  variable  was  the  four  conditions 
described  previously.  Ten  full  sets  of  responses  were 
collected  in  each  condition  (except  in  condition  4  where  11 
full  sets  were  collected  by  mistake.)  Therefore,  there  were 
10  subjects  in  condition  1,  20  in  conditions  2  and  3,  and  22 


in  condition  4. 

Within  subjects  both  reliability  and  diagnosticity 
varied  at  six  levels.  The  levels  were  .60,  .67,  .75,  .80, 
.90,  and  .95.  However,  the  six  levels  of  each  were  not 
fully  crossed,  but  grouped  as  three  sets  of  2  x  2's  thus 
yielding  the  12  trials.  Within  any  one  of  the  2  x  2's,  the 
cells  off  the  diagonal  yielded  the  same  final  normative 
probability.  The  three  2  x  2's  and  the  normative  final 
probabilities  are  presented  in  Table  1. 


Insert  Table  1  about  here. 


Analyses 

ANOVA 1 s .  Preliminary  ANOVA' s  were  run  on  "reflected" 
log- likelihood  ratios.  (Normative  odds  were  both  greater 

and  less  than  unity.  For  the  purpose  of  analysis,  the 
reciprocal  of  the  raw  response  and  of  the  normative  response 
was  taken  on  trials  in  which  the  normative  result  was  less 
than  unity  before  being  logged  for  all  analyses  unless 
otherwise  noted.)  A  separate  ANOVA  was  run  on  each  of  the  2 
x  2's.  The  F  ratio  for  the  between  subjects  manipulation  did 
not  approach  a  level  of  classical  significance.  This  was 
unexpected  since  one  of  the  differences  between  conditions 
is  that  condition  1  had  only  single  subjects  working  alone 
whereas  the  other  conditions  had  two  subjects  working 
together . 
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TABLE  1 


Level  of  Reliability,  Diagnosticity,  and  Normative  Probability 


Reliability 

Diagnosticity 

Normative  Probability 

.9 

.9 

.820 

.9 

.75 

.700 

•  v  0 

.9 

.700 

.75 

.75 

.625 

,9S 

.95 

.905 

.95 

.6 

.590 

.6 

.95 

.590 

.6 

.6 

.520 

.3 

.8 

.680 

.8  ' 

.67 

.602 

.67 

.3 

.602 

67 


67 


558 
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I  For  the  within  subject  variables  (diagnosticity  and 

reliability) ,  diagnosticity  was  highly  significant  (P  < 
.0001)  in  all  three  of  the  2  x  2's.  Reliability  was 
significant  in  two  of  the  2  x  2's  (P  <  .0001  &  P  *  .002). 

An  interaction  of  reliability  and  diagnosticity  was 
significant  in  one  (P  *  <.0104).  The  mean  log-likelihood 
ratios  are  plotted  in  Figure  1,  collapsing  over  condition. 

►  - - - 


Insert  Figure  1  about  here. 
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The  general  picture  that  emerges  is  that  diagnosticity 
has  a  large  effect,  reliability  has  a  smaller  effect,  and 
possibly  there  is  an  interaction  which  indicates  that 
reliability  has  a  greater  effect  at  che  higher  levels  of 
diagnosticity. 

For  closer  examination  of  the  data  we  used  individual 
correlational  and  regression  analyses. 

Correlational  and  regression  analyses .  Table  2  shows 
correlational  analyses  on  a  subject  by  subject  basi3. 


Insert  Table  2  about  here 


All  of  these  analyses  were  performed  on  the  log-likelihood 
ratios  obtained  in  the  following  ways  and  the  log  of  the 
odds  that  the  subjects  gave.  Row  1  used  the  full-range 
normative  likelihood  ratios  (and  responses)  ,  row  2  used 
responses  and  normative  likelihood  ratios  that  are  reflected 
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LOG— ODDS  RESPONSE 


.25 


•  reliability  =  .75 
A  reliability  =  .95 


1.25 

1.00 


.50 


.60  .70  .80  .90 

diagnosticity 


•  reliability  =  .60 
A  reliability  =  .95 


diagnosticity 


•  reliability  =  .67 
▲  reliability  s  .95 


diagnosticity 

Figure  1:  Mean  Log-Odds  Responses  for  Varying  Levels 
of  Reliability  and  Diagnosticity 
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Correlation  Means  and  Standard  Deviations  for  Logs  of  Responses 
with  Logs  of  Normative  Result  and  Various  Heuristics 
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so  that  all  are  greater  than  or  equal  to  unity.  All  of  the 
rest  of  the  rows  also  used  reflected  responses  and 
likelihood  ratios.  Rows  3  and  4  used,  respectively,  the 
likelihood  ratios  obtained  just  by  using  the  reliability  or 
diagnosticity  information.  Row  6  used  reliability  times 
diagnosticity ,  both  in  probability  form,  and  then  converted 
to  likelihood  ratio  form.  Row  7  multiplies  reliability  in 
probability  form  times  tne  likelihood  ratio  form  of 
diagnosticity;  and  row  8  uses  the  likelihood  ratio  of 
diagnosticity  raised  to  a  power  equal  to  the  probability 
form  of  the  reliability.  (Formulas  are  in  Table  2.)  As  in 
the  ANOVA,  no  interpretable  differences  among  conditions 
were  found. 

Table  3  presents  sample  statistics  computed  on  the 
slopes  obtained  using  the  log  of  the  reflected  obtained 
likelihood  ratios  as  the  dependent  variable  and  the  log  of 
the  normative  reflected  likelihood  ratios  as  the  independent 
variable. 


Insert  Table  3  about  here. 


Again,  no  interpretable  differences  between  conditions  are 
found.  The  high  correlations  with  the  normative  responses 
and  the  slightly  greater  than  unity  betas  of  Table  3 
indicate  that  subjects  do  a  generally  good  job  on  the 
hierarchical  task.  The  betas  indicate  that  subjects  are 


j 
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somewhat  radical  in  their  probability  estimates — essentially 


TABLE  3 


Sample  Statistics  for  Slopes 
of  Log  Responses  with  Log  Normative  Odds 

Condition  1  Condition  2  Condition  3  Condition  4 

Median  1.15  1.13  1.23  1.17 

Mean  1.0990  1.0508  1.2330  1.2287 

Standard 

Deviation  .195  .275  .142  .327 


the  typical  finding.  Figure  2  plots  the  mean  log  likelihood 
ratio  for  each  of  the  12  trials  against  the  normatively 
correct  log  likelihood  ratio,  collapsing  across  individuals 
and  conditions. 


Insert  Figure  2  about  here. 


This  also  demonstrates  the  general  pattern  of  radicalness. 

Rows  3  and  4  of  Table  2  (the  correlations  of  obtained 
log  likelihood  ratios  with  those  obtained  from  the 
reliability  and  diagnosticity  levels  used  in  the  experiment) 
yield  correlations  which  indicate  that  subjects  relied  more 
heavily  on  diagnosticity  than  reliability  in  making  their 
inferences.  (An  examination  of  the  normative  formula 

provided  earlier  shows  that  each  should  be  used  equally  in 
making  the  inferences.)  Again,  there  were  no  interpretable 
between  conditions  differences. 

The  last  four  rows  of  Table  2  briefly  examine  four 
possible  heuristics  of  which  the  likelihood  form  of 
diagnosticity,  L(D),  raised  to  the  probability  form  of 
reliability,  P(R),  provides  the  best  fit  to  the  data. 
However,  examination  of  the  individual  correlations  (which 
are  not  provided  in  this  report)  shows  individual 
differences  in  what  the  subjects  seemed  to  be  doing. 

DISCUSSION 

This  study  has  two  main  conclusions:  that  reliability 

information  in  general  is  not  used  appropriately  which  leads 
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Figure  2:  Mean  Log-Odds  Responses  Plotted  on  Normative 
Log-Odds  Results 
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to  radicalism  in  probability  estimates,  and  that  different 
inuividuals  working  together,  each  receiving  different  types 
of  information,  respond  much  the  same  as  would  an  individual 
working  alone. 

The  former  conclusion  is  a  replication  of  the  standard 
finding  in  cascaded  inference  (e.g.,  see  Peterson  et  al . , 
1973)  .  Scnum,  Du  Charme,  and  DePitts  (1973)  suggest  a 
possible  explanation:  it  is  non-obvious  how  unreliability 
degrades  likelihood  ratios,  especially  when  they  are  large. 
Panel  A  of  Figure  3  demonstrates  this  normative  phenomenon. 


Insert  Figure  3  about  here. 


However,  if  instead  of  looking  at  adjusted  likelihood  ratios 
we  examine  adjusted  conditional  probabilities,  the  non¬ 
linearity  of  the  appropriate  degradation  disappears  (Panel 
B) .  This  suggests  that  a  log-odds  scale  may  not  be  the  most 
appropriate  response  mode  for  cascaded  tasks. 


FINAL  PROBABILITY  FINAL  ODDS 
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Figure  3:  Comparison  of  Normative  Final  Odds  vs.  Noimative 
Final  Probability  for  Hierarchical  Inference 
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